Machine learning to predict numerical outcomes in a matrix-defined problem space

ABSTRACT

Systems and methods for predicting feature values in a matrix are disclosed. In example embodiments, a server accesses a matrix, the matrix having multiple dimensions, one dimension of the matrix representing features, and one dimension of the matrix representing entities. The server separates the matrix into multiple submatrices along a first dimension, each submatrix including all cells in the matrix for a set of values in the first dimension. The server provides the multiple submatrices to multiple machines. The server computes, using each machine, a correlation between values in at least one second dimension of the matrix and a value for a preselected feature in the matrix, the correlation being used to predict the value for the preselected feature based on other values along the at least one second dimension. The server provides an output representing the computed correlation.

TECHNICAL FIELD

The present disclosure generally relates to machines configured forlearning to predict numerical outcomes in a matrix-defined problemspace, including computerized variants of such special-purpose machinesand improvements to such variants, and to the technologies by which suchspecial-purpose machines become improved compared to otherspecial-purpose machines that predict numerical outcomes. In particular,the present disclosure addresses systems and methods for implementingmachine learning to predict numerical outcomes in a matrix-definedproblem space.

BACKGROUND

Predicting numerical outcomes in a matrix-defined problem space may bedesirable. For example, a matrix may store a set of past employers formultiple employees. It may be desirable to use this matrix to predict afuture employer of one or more of the employees.

BRIEF DESCRIPTION OF THE DRAWINGS

Some embodiments of the technology are illustrated, by way of exampleand not limitation, in the figures of the accompanying drawings.

FIG. 1 illustrates an example system for predicting numerical outcomesin a matrix-defined problem space, in accordance with some embodiments.

FIG. 2 is a diagram illustrating an example of model parameters forpredicting feature values in a matrix, in accordance with someembodiments.

FIG. 3 is a data flow diagram for predicting feature values in a matrix,in accordance with some embodiments.

FIG. 4 illustrates a slice of the three-dimensional matrix, inaccordance with some embodiments.

FIG. 5 is flow chart of a method for training a machine, in accordancewith some embodiments.

FIG. 6 is a schematic diagram of a technique for associating a titlewith at least one area of expertise, in accordance with someembodiments.

FIG. 7 is a schematic diagram of a technique for predicting values in amatrix, in accordance with some embodiments.

FIG. 8 is a flow chart of a method for training a machine and predictingvalues in a matrix, in accordance with some embodiments.

FIG. 9 is a flow chart of a method for ranking job candidates, inaccordance with some embodiments.

FIG. 10 is a flow chart of a method for negative sampling, in accordancewith some embodiments.

FIG. 11 is schematic diagram of matrices that may be used in negativesampling, in accordance with some embodiments.

FIG. 12 is a block diagram illustrating components of a machine able toread instructions from a machine-readable medium and perform any of themethodologies discussed herein, in accordance with some embodiments.

DETAILED DESCRIPTION

Overview

The present disclosure describes, among other things, methods, systems,and computer program products that individually provide variousfunctionality. In the following description, for purposes ofexplanation, numerous specific details are set forth in order to providea thorough understanding of the various aspects of different embodimentsof the present disclosure. It is evident, however, to one skilled in theart, that the present disclosure may be practiced without all of thespecific details.

Some aspects of the technology described herein relate to predictingnumerical outcomes in a matrix-defined problem space. Numerical outcomesmay include any outcomes that may be expressed numerically, such asBoolean outcomes, integer outcomes or any other outcomes that arecapable of being expressed with number(s). In some implementations, acontrol server accesses a matrix of examples, the matrix having multipledimensions, one dimension of the matrix representing features, and onedimension of the matrix representing data points, which may correspondto any examples. In one innovative example, the data points correspondto examples of employees having the features of past or presentemployers. The control server separates the matrix into multiplesubmatrices along a first dimension, each submatrix including all cellsin the matrix for a set of values in the first dimension. The controlserver provides the multiple submatrices to multiple computationservers, each computation server being provided with a single submatrix.Each computation server computes a correlation between values in atleast one second dimension of the matrix and a value for a preselectedfeature in the matrix. The correlation is used to predict the value forthe preselected feature based on other values along the at least onesecond dimension. Each computation server provides an outputrepresenting the computed correlation.

Factorization machines (FMs) and their extension, field-awarefactorization machines (FFMs), have a broad range of applications formachine learning tasks including regression, classification,collaborative filtering, search ranking, and recommendation. In thisdocument, a scalable implementation of the FFM learning model that runson a standard Spark/Hadoop cluster is presented, among other things. Onecontribution, among others, includes a prediction algorithm that runs inlinear time for models with higher order interactions of rank three orgreater. Some aspects further describe the basic components of the FFM,including feature engineering and negative sampling, sparse matrixtransposition, a training data and model co-partitioning strategy, and acomputational graph-inspired training algorithm.

One distributed training algorithm and system optimizations enable someaspects to train FFM models at high speed and scale on commodityhardware using off-the-shelf big data processing frameworks such asHadoop and Spark.

Some aspects solve the problem of correlating values in matrices. Forexample, information about a user of a professional network may bestored in a matrix, which may be used to recommend next professionalmoves (e.g. a new job) for the user. This problem is solved at a server.The server accesses a matrix, the matrix having multiple dimensions, onedimension of the matrix representing features, and one dimension of thematrix representing data points. The server separates the matrix intomultiple submatrices along a first dimension, each submatrix includingall cells in the matrix for a set of values in the first dimension. Theserver provides the multiple submatrices to multiple machines. Theserver computes, using each machine, a correlation between values in atleast one second dimension of the matrix and a value for a preselectedfeature in the matrix, the correlation being used to predict the valuefor the preselected feature based on other values along the at least onesecond dimension. Some advantages and improvements include moreefficient distributed (among multiple machines) computation ofcorrelation of values in matrices. By distributing the submatrices amongmultiple machines and having each machine generate its own correlation,some aspects improve (e.g. increase) the speed at which the server isable to generate the output. The server would generate the output muchmore slowly if the server were to process the entire matrix without theassistance of the machines.

Some aspects solve the problem of correlating values in matrices. Forexample, information about a user of a professional network may bestored in a matrix, which may be used to recommend next professionalmoves (e.g. a new job) for the user. This problem is solved at a controlserver. The control server accesses a matrix. The matrix has featurecolumns and example rows. The control server shards the matrix byfeatures and by examples to generate feature shards and example shards,respectively. The control server distributes the feature shards and theexample shards among a plurality of computation servers, eachcomputation server from the plurality including an inner componentstoring at least one feature shard and an outer component storing atleast one example shard. The control server receives a first correlationcorrelating data in an inner component of at least a first computationserver with data in the outer components of the plurality of computationservers, the first correlation generated by running an inner pass. Thecontrol server receives a second correlation correlating data in anouter component of at least a second computation server with data in theinner components of the plurality of computation servers, the secondcorrelation generated by running an outer pass. The control serverstores the first correlation and the second correlation at the controlserver. The control server provides an output associated with at leastthe first correlation and the second correlation. Some advantages andimprovements include more efficient distributed (among multiplecomputation servers and the control server) computation of correlationof values in matrices. By distributing the submatrices among multiplecomputation servers and having each computation server generate its owncorrelation, some aspects improve (e.g. increase) the speed at which thecontrol server is able to generate the output. The control server wouldgenerate the output much more slowly if the control server were toprocess the entire matrix without the assistance of the computationservers.

Some aspects solve the problem of ranking job candidates when multiplecandidates are available for an opening at a business. A serverreceives, from a client device, a request for job candidates for anemployment position, the request comprising search criteria. The servergenerates, based on the request, a set of job candidates for theemployment position. The server provides, to the client device, a promptfor ranking the set of job candidates. The server receives, from theclient device, a response to the prompt. The server ranks the set of jobcandidates based on the received response. The server provides, fordisplay at the client device, an output based on the ranked set of jobcandidates. Some advantages and improvements include the ability to rankjob candidates based on feedback from the client device and theprovision of prompts for this feedback.

In machine learning based on real-world information, there are manypositive examples, but few negative examples. For example, aprofessional network may store information about candidates who recentlyreceived jobs at Company X, but may not store information aboutcandidates who were rejected by Company X. Some aspects solve theproblem of generating negative examples (e.g. generating examples ofpeople who are not good fits for Company X). One solution assumes that,while the people who got jobs at Company X are good fits, most otherpeople are not good fits (as good fits for Company X are rare among theusers of the professional network). A server accesses a matrixrepresenting users of the professional network. The matrix has rowsrepresenting entities (e.g. people) and columns representing features(e.g. works at Company X). The server selects a specific column in thematrix for randomization. The server partitions the matrix by row intomultiple submatrices. The server assigns the multiple submatrices tomultiple machines, each submatrix being assigned to one machine and eachmachine being assigned to one submatrix. The server receives, from eachmachine, a shuffled submatrix generated at the machine by shuffling thevalues in the specific column among the rows of the submatrix assignedto the machine. The server merges the shuffled submatrices into ashuffled matrix. The server provides an output representing the shuffledmatrix. Some advantages and improvements include the ability to generatenegative samples using the matrix.

FIG. 1 illustrates an example system 100 for predicting numericaloutcomes in a matrix-defined problem space, in accordance with someembodiments. As shown, the system includes a control server 110,computation servers 120, and a client device 130 connected to oneanother via a network 140. While a single control server 110 and asingle client device 130 are illustrated, the technology may beimplemented with multiple control servers or multiple client devices.Also, while three computation servers 120 are illustrated, there may beany number of computation servers 120. The network 140 may include oneor more networks, such as the Internet, an intranet, a local areanetwork, a wide area network, a wired network, a wireless network, avirtual private network, and the like. The client device 130 may includea laptop computer, a desktop computer, a mobile phone, a tablet, a smartwatch, a smart television, a personal digital assistant, a digital musicplayer, and the like.

According to some examples, the control server 110 stores (or is coupledwith a data repository that stores) a matrix. The matrix has multipledimensions. One of the dimensions represents features, such asemployers, job titles, universities attended, and the like. One of thedimensions represents entities, such as individuals or employees. Thecontrol server 110 separates the matrix into multiple submatrices alonga first dimension (e.g. features). Each submatrix includes all cells inthe matrix for a set of values in the first dimension (e.g., all valuesfor the employer “ABC Corporation”). The control server 110 provides themultiple submatrices to multiple computation servers 120.1-3. Eachcomputation server 120.k (where k is a number between 1 and 3) isprovided with a single submatrix. It should be noted that, while threecomputation servers 120 are illustrated, there may be any number ofcomputation servers 120. Some implementations may use hundreds orthousands of computation servers 120.

Each computation server 120.k computes a correlation between values inat least one second dimension of the matrix and a value for apreselected feature in the matrix. The correlation is used to predictthe value for the preselected feature based on other values along the atleast one second dimension. The computation servers 120 provide anoutput representing the computed correlation. The merged output may beused to make predictions, and representations of the predictions may beprovided for display at the client device 130.

In some cases, the preselected feature may be “employment at DEFInsurance Company.” The computation server 120 may determine thatfeatures such as studying sales and marketing in college, having aBachelor's degree, having worked at “GHI Insurance Company” or havingworked at “JKL Insurance Company” are highly positively correlated withthe feature of “employment at DEF Insurance Company.” Other features,such as having worked at “MNO Technologies,” might not be correlated ormight be negatively correlated with “employment at DEF InsuranceCompany.”

Factorization machines (FMs) model all interactions between featuresusing factorized parameters. Field-aware factorization machines (FFMs)are a subset of factorization machines that are used, for example, inrecommender systems. FMs and FFMs may be used in machine learning tasks,such as regression, classification, collaborative filtering, searchranking, and recommendation. Some implementations of the technologydescribed herein leverage FFMs to provide a prediction algorithm thatruns in linear time. The training algorithm and system optimizationsdescribed herein enable training of the FFM models at high speed and ata large scale. Thus, FFMs based on large datasets and large parameterspaces may be trained.

In some embodiments, FMs are a model class that combines the advantagesof Support Vector Machines (SVMs) with factorization models. Like SVMs,FMs are a general predictor working with any real valued feature vector.In contrast to SVMs, FMs model all interactions between features usingfactorized parameters. Thus they are able to estimate interactions evenin problems with huge sparsity (like recommender systems) where SVMsfail. FMs are class of models which generalize Tensor Factorizationmodels and polynomial kernel regression models. FMs uniquely combine thegenerality and expressive power of prior approaches (e.g., SVMs) withthe ability to learn from sparse datasets using factorization (e.g.,SVD++). FFMs extend FMs to also model effects of different interactiontypes, defined by fields.

For a large professional networking or employee-finding service, aproduction-level machine learning system may have a number ofoperational goals, such as scale. Large-scale distributed systems areneeded to adequately process data of millions of professional networkmembers.

The FFM model extends the FM model with information about groups (calledfields) of features. The FFM model equation is shown in Equation 1. InEquation 1, y represents the FFM value, x is the feature vector, w₀ isthe global bias, w is the unary bias, and V is the tensor of interactionvectors. The collection of w₀, w, and V are the parameters of the model.In Equation 1, x∈R^(m), w₀∈R, and V∈R^(rsm).

$\begin{matrix}{{\hat{y}\left( {{x;w_{0}},w,V} \right)} = {w_{0} + {\sum\limits_{j = 1}^{m}\left\langle {w,x} \right\rangle} + {\sum\limits_{j = 1}^{m}{\sum\limits_{k > j}^{m}{x_{j}x_{k}\left\langle {V_{{\alpha{(k)}}j},V_{{\alpha{(j)}}k}} \right\rangle}}}}} & {{Equation}\mspace{14mu} 1}\end{matrix}$

In some cases, the block synchronous parallel (BSP) model of computationis used for distributed computations (e.g., at the computation servers120). The BSP model may include the following components: a set ofprocessors b₁, . . . , b_(n), each with local memory (e.g., computationservers 120), a network that connects each processor to each otherprocessor (e.g., network 140), and a synchronization device thatsynchronizes the processors (e.g., control server 110). To distributeobject among the processors (e.g., computation servers 120), shardingschemes may used. A sharding scheme partitions data for distribution inone specific example, the sharding scheme identifies feature partitionsand assigns them to a computation server 120.k. Each distributed objectmay have at least one sharding scheme. For example, given w in R^(m), asharding scheme h would allocate w_(i) in b_(h(i)) for i between 1 andm. Some partitioning schemes may uniformly distribute data amongprocessors. Two distributed objects are said to be co-partitioned withrespect to h if they are both sharded by h.

In some examples, a dataset D may be used, where D includes a series ofexamples x_(i) and labels y_(i): D={(x_(i), y_(i))}_(i=1) ^(n), wherex_(i)∈R^(m) and y_(i)∈R. In one implementation, there are n examples inthe dataset, which are indexed 1≤i≤n. In one implementation, eachexample has m features, which are indexed using 1≤j≤m. For each feature,there are r associated latent factors, which are indexed using 1≤ƒ≤r.Each feature belongs to exactly one field according to a field indexmapping function: α: {1, . . . , m}→{1, . . . , s}. There are s distinctfields indexed by p and q. This mapping α defines a disjoint union ofnonempty fields F_(p)⊆{1, . . . , m} for each 1≤p≤s. Some aspectsfurther define m_(p) ≡|F_(p)| as the cardinality of field p, which leadsto Equation 2.

$\begin{matrix}{m = {\sum\limits_{p = 1}^{s}m_{p}}} & {{Equation}\mspace{14mu} 2}\end{matrix}$

Equation 1 may be rewritten as Equation 3, where Equation 4, Equation 5,and Equation 6 apply

$\begin{matrix}{{\hat{y}\left( x_{i} \right)} = {{{\hat{y}}_{bias}\left( x_{i} \right)} + {{\hat{y}}_{inter}\left( x_{i} \right)} + {{\hat{y}}_{intra}\left( x_{i} \right)}}} & {{Equation}\mspace{14mu} 3} \\{{{\hat{y}}_{bias}\left( x_{i} \right)} = {w_{0} + \left\langle {w,x_{i}} \right\rangle}} & {{Equation}\mspace{14mu} 4} \\{{{\hat{y}}_{inter}\left( x_{i} \right)} = {\sum\limits_{p = 1}^{s}{\sum\limits_{q > p}^{s}{\sum\limits_{j \in F_{p}}{\sum\limits_{k \in F_{q}}{x_{ij}x_{ik}\left\langle {v_{\,^{*}{qj}},v_{\,^{*}{pk}}} \right\rangle}}}}}} & {{Equation}\mspace{14mu} 5} \\{{{\hat{y}}_{intra}\left( x_{i} \right)} = {\sum\limits_{p = 1}^{s}{\sum\limits_{j \in F_{p}}{\sum\limits_{{({k > j})} \in F_{p}}{x_{ij}x_{ik}\left\langle {v_{\,^{*}{pj}},v_{\,^{*}{pk}}} \right\rangle}}}}} & {{Equation}\mspace{14mu} 6}\end{matrix}$

Equation 5 may be rewritten as Equation 7.

$\begin{matrix}\begin{matrix}{{{\hat{y}}_{inter}\left( x_{i} \right)} = {\sum\limits_{p = 1}^{s}{\sum\limits_{q > p}^{s}{\sum\limits_{j \in F_{p}}{\sum\limits_{k \in F_{p}}{x_{ij}x_{ik}\left\langle {v_{\,^{*}{qj}},v_{\,^{*}{pk}}} \right\rangle}}}}}} \\{= {\sum\limits_{f = 1}^{r}{\sum\limits_{p = 1}^{s}{\sum\limits_{q > p}^{s}{\sum\limits_{j \in F_{p}}{\sum\limits_{k \in F_{p}}{x_{ij}x_{ik}v_{fqj}v_{fpk}}}}}}}} \\{= {\sum\limits_{f = 1}^{r}{\sum\limits_{p = 1}^{s}{\sum\limits_{q > p}^{s}{\left( {\sum\limits_{j \in F_{p}}{x_{ij}v_{fqj}}} \right)\left( {\sum\limits_{j \in F_{q}}{x_{ij}v_{fpj}}} \right)}}}}}\end{matrix} & {{Equation}\mspace{14mu} 7}\end{matrix}$

Equation 6 may be rewritten as Equation 8.

$\begin{matrix}\begin{matrix}{{{\hat{y}}_{intra}\left( x_{i} \right)} = {\sum\limits_{p = 1}^{s}{\sum\limits_{j \in F_{p}}{\sum\limits_{{({k > j})} \in F_{p}}{x_{ij}x_{ik}\left\langle {v_{\,^{*}{pj}},v_{{\,^{*}p}\; k}} \right\rangle}}}}} \\{= {\sum\limits_{f = 1}^{r}{\sum\limits_{p = 1}^{s}{\sum\limits_{j \in F_{p}}{\sum\limits_{{({k > j})} \in F_{p}}{x_{ij}x_{ik}v_{fpj}v_{fpk}}}}}}} \\{= {\frac{1}{2}{\sum\limits_{f = 1}^{r}\left\lbrack {{\sum\limits_{p = 1}^{s}{\sum\limits_{j \in F_{p}}{\sum\limits_{k \in F_{p}}{x_{ij}x_{ik}v_{fpj}v_{fpk}}}}} -} \right.}}} \\\left. {\sum\limits_{p = 1}^{s}{\sum\limits_{j \in F_{p}}{x_{ij}^{2}v_{fpj}^{2}}}} \right\rbrack \\{= {\frac{1}{2}{\sum\limits_{f = 1}^{r}\left\lbrack {{\sum\limits_{p = 1}^{s}{\left( {\sum\limits_{j \in F_{p}}{x_{ij}v_{fpj}}} \right)\left( {\sum\limits_{k \in f_{p}}{x_{ik}v_{fpk}}} \right)}} -} \right.}}} \\\left. {\sum\limits_{j = 1}^{m}{x_{ij}^{2}v_{f\;\alpha\;{(j)}j}^{2}}} \right\rbrack \\{= {{\frac{1}{2}{\sum\limits_{f = 1}^{r}{\sum\limits_{p = 1}^{s}\left( {\sum\limits_{j \in F_{p}}{x_{ij}v_{fpj}}} \right)^{2}}}} -}} \\{\frac{1}{2}{\sum\limits_{f = 1}^{r}{\sum\limits_{j = 1}^{m}{x_{ij}^{2}v_{f\;{\alpha{(j)}}j}^{2}}}}}\end{matrix} & {{Equation}\mspace{14mu} 8}\end{matrix}$

The Equations 3, 4, 5, 6, 7, and 8 may be recombined to generateEquation 9, where Equation 10 and Equation 11 apply.

$\begin{matrix}{{\hat{y}\left( x_{i} \right)} = {{{\hat{y}}_{bias}\left( x_{i} \right)} + {\sum\limits_{f = 1}^{r}{\sum\limits_{p = 1}^{s}{\sum\limits_{q \geq p}^{s}{c_{pq}a_{pqf}a_{qpf}}}}} - {\frac{1}{2}{\sum\limits_{f = 1}^{r}{\sum\limits_{j = 1}^{m}{x_{ij}^{2}v_{f\;{\alpha{(j)}}j}^{2}}}}}}} & {{Equation}\mspace{14mu} 9} \\{\mspace{79mu}{a_{pqf} \equiv {\sum\limits_{j \in F_{p}}{x_{ij}v_{fqj}}}}} & {{Equation}\mspace{14mu} 10} \\{\mspace{79mu}{c_{pq} \equiv \left\{ \begin{matrix}\frac{1}{2} & {{{if}\mspace{14mu} p} = q} \\1 & {otherwise}\end{matrix} \right.}} & {{Equation}\mspace{14mu} 11}\end{matrix}$

In some cases, the a_(pqf) terms may be precomputed and arranged into atensor of dimension s×s×r. To assess the time complexity of populatingthis tensor, consider each slice as defined in Equation 12.

$\begin{matrix}{A_{p} \equiv \begin{bmatrix}a_{p\; 11} & a_{p\; 12} & \ldots & a_{p\; 1r} \\\vdots & \vdots & \ddots & \vdots \\a_{p\; s\; 1} & a_{p\; s\; 2} & \ldots & a_{psr}\end{bmatrix}} & {{Equation}\mspace{14mu} 12}\end{matrix}$

Note that each slice A_(p) has dimension s×r, and computing each of itsterms a_(pqf) takes O(m_(p)) time due to Equation 8. This results in atime complexity of O(m_(p)rs) for computing each slice A_(p). Thus, topopulate the whole tensor, the time complexity is set forth in Equation13.

$\begin{matrix}{{\sum\limits_{p = 1}^{s}{O\left( {m_{p}{rs}} \right)}} = {O({mrs})}} & {{Equation}\mspace{14mu} 13}\end{matrix}$

In addition, each of these s×s×r cells is consumed by Equation 9, so thetime complexity of using this tensor is bounded by O(rs²). Ignoring thebias term and the subtracted part at the end of Equation 9, which arerespectively computed in O(m) and O(mr) time, this results in a modelprediction time complexity of O((mrs+rs²). As each feature has exactlyone field, and every field is nonempty, some aspects know s≤m, whichsimplifies the prediction time complexity to O(mrs).

A function ƒ of m variables θ₁, . . . , θ_(m)∈R can be decomposed into aset of functions g, h_(ij) for 1≤l≤d and 1≤j≤m where Equation 14applies, such that the collection of new variables ϕ_(i)∈R for 1≤i≤dsatisfy a system of d additively separable equations on each variableθ_(j) for 1≤j≤m, where Equation 15 applies.

$\begin{matrix}{{f\left( {\theta_{1},\ldots\mspace{14mu},\theta_{m}} \right)} = {g\left( {\phi_{1},\ldots\mspace{14mu},\phi_{d}} \right)}} & {{Equation}\mspace{14mu} 14} \\\begin{matrix}{\phi_{1} = {\sum\limits_{j = 1}^{m}{h_{1j}\left( \theta_{j} \right)}}} \\\vdots \\{\phi_{d} = {\overset{m}{\sum\limits_{j = 1}}{h_{dj}\left( \theta_{j} \right)}}}\end{matrix} & {{Equation}\mspace{14mu} 15}\end{matrix}$

The resulting collection of functions is the d-separable decompositionof ƒ. Furthermore, d is the degree of separability.

This manner of decomposition is motivated by the inherent parallelizablestructure of computation over the parameters. Each ϕ₁; : : : ; ϕ_(d) canbe computed in parallel from each parameter ϕ₁; : : : ; ϕ_(m) which isexplicit in the vectorized form of the equation in Equation 15. Thus,Equation 16 and Equation 17 apply.

$\begin{matrix}{\phi_{1} = {\sum\limits_{j = 1}^{m}{h_{j}\left( \theta_{j} \right)}}} & {{Equation}\mspace{14mu} 16} \\{{h_{j}\left( \theta_{j} \right)} = \begin{pmatrix}{h_{1j}\left( \theta_{j} \right)} \\\vdots \\{h_{dj}\left( \theta_{j} \right)}\end{pmatrix}} & {{Equation}\mspace{14mu} 17}\end{matrix}$

In Equations 15, 16, and 17, θ=(θ₁, . . . , θ_(m)) and ϕ=(ϕ₁, . . . ,ϕ_(d)). It should be noted that ϕ may be computed in parallel bycomputing each h_(j)(θ_(j)) in parallel over 1≤j≤m and summing overtheir results.

Vectorized notation may be used. Aspects relate to decompositions thatminimize d for use in parallel algorithms. Minimizing d minimizes memoryoverhead within a single machine (e.g., computation server 120.k) andcommunication overhead within distributed environments. Observe that atrivial decomposition exists for all functions by letting g=ƒ andchoosing each h_(j) such that ϕ=θ for 1≤j≤m, which results in d=m. Thus,it is always the case that there exists d such that d≤m.

When the feature vector x and global bias w₀ are held constant, the FFMis a function of mrs+m variables, which are the parameters of the modelof Equation 18.ƒ(w,V)=y(x;w ₀ ,w,V)   Equation 18

It can be shown that ƒ is d-separable where d=rs²+2 where r is thenumber of latent factors and s is the number of fields. To simplify thedecomposition, consider a decomposition of h_(j) into the sum of twosimpler functions: h_(j) ^((w)) which is a function of the featurebiases w and h_(j) ^((V)) which is a function of the latent vectors V.Both h_(j) ^((w)) and h_(j) ^((V)) have the same range as R^(d), andEquation 19 applies.

$\begin{matrix}{{f\left( {w,V} \right)} = {g\left( {{\sum\limits_{j = 1}^{m}\;{h_{j}^{(w)}\left( w_{j} \right)}} + {\sum\limits_{j = 1}^{m}{h_{j}^{(V)}\left( V_{j} \right)}}} \right)}} & {{Equation}\mspace{14mu} 19}\end{matrix}$

A machine can derive g and h_(j) ^((w)) h_(j) ^((V)) for each 1≤j≤m fromthe model equation of the FFM. The FFM model equation is set forth asEquation 20, where Equations 21-24 apply.

$\begin{matrix}{{\hat{y}\left( {{x;w_{0}},w,V} \right)} = {w_{0} + \phi^{(1)} + {\sum\limits_{f = 1}^{r}{\sum\limits_{p = 1}^{s}{\sum\limits_{q \geq p}^{s}{c_{qp}\phi_{fqp}^{(2)}\phi_{fpq}^{(2)}}}}} - {\frac{1}{2}\phi^{(3)}}}} & {{Equation}\mspace{14mu} 20} \\{\mspace{79mu}{c_{qp} = \left\{ \begin{matrix}\frac{1}{2} & {{{if}\mspace{14mu} p} = q} \\1 & {otherwise}\end{matrix} \right.}} & {{Equation}\mspace{14mu} 21} \\{\mspace{79mu}{\phi^{(1)} = {\sum\limits_{j = 1}^{m}{w_{j}x_{j}}}}} & {{Equation}\mspace{14mu} 22} \\{\mspace{79mu}{\phi^{(2)} = {\sum\limits_{j = 1}^{m}{x_{j}{\pi_{\alpha{(j)}}\left( V_{j} \right)}}}}} & {{Equation}\mspace{14mu} 23} \\{\mspace{79mu}{\phi^{(3)} = {\sum\limits_{j = 1}^{m}{x_{j}^{2}{\sum\limits_{f = 1}^{r}v_{f\;{\alpha{(j)}}j}^{2}}}}}} & {{Equation}\mspace{14mu} 24}\end{matrix}$

It should be noted that π_(p) is the projection of V_(j) into index p ofdimension 3 of a 3-dimensional tensor. The machine begins byconstructing each h_(j) ^((w)) and h_(j) ^((V)) which is to produce ad-dimensional vectors. Let Φ=R×R^(rs2)×R be the space of ϕ=(ϕ⁽¹⁾; ϕ⁽²⁾;ϕ⁽³⁾) objects. Φ is a vector space which is the range of h_(j) ^((w))and h_(j) ^((V)). These functions compute the following quantities shownin Equations 25 and 26.

$\begin{matrix}{{h_{j}^{(w)}\left( w_{j} \right)} = \begin{pmatrix}{w_{j}x_{j}} \\0 \\0\end{pmatrix}} & {{Equation}\mspace{14mu} 25} \\{{h_{j}^{(V)}\left( V_{j} \right)} = \begin{pmatrix}0 \\{x_{j}{\pi_{\alpha{(j)}}\left( V_{j} \right)}} \\{x_{j}^{2}{\sum\limits_{k = 1}^{r}v_{f\;{\alpha{(j)}}j}^{2}}}\end{pmatrix}} & {{Equation}\mspace{14mu} 26}\end{matrix}$

The sum of h_(j) ^((w)) and h_(j) ^((V)) over all 1≤j≤m results in thedesired ϕ vector of Equation 26′ that satisfies Equations 19, 20, and21.

$\begin{matrix}{\phi = {\begin{pmatrix}\phi^{(1)} \\\phi^{(2)} \\\phi^{(3)}\end{pmatrix} = {\begin{pmatrix}{\sum\limits_{j = 1}^{m}{w_{j}x_{j}}} \\{\sum\limits_{j = 1}^{m}{x_{j}{\pi_{\alpha{(j)}}\left( V_{j} \right)}}} \\{\sum\limits_{j = 1}^{m}{x_{j}^{2}{\sum\limits_{f = 1}^{r}v_{f\;{\alpha{(j)}}j}^{2}}}}\end{pmatrix} = {{\sum\limits_{j = 1}^{m}{h_{j}^{(w)}\left( w_{j} \right)}} + {\sum\limits_{j = 1}^{m}{h_{j}^{(V)}\left( V_{j} \right)}}}}}} & {{Equation}\mspace{14mu} 26^{\prime}}\end{matrix}$

In Equation 27, g is constructed as a remaining functional form of ŷ asa function of ϕ.

$\begin{matrix}{{g(\phi)} = {w_{0} + \phi^{(1)} + {\sum\limits_{f = 1}^{r}{\sum\limits_{p = 1}^{s}{\overset{s}{\sum\limits_{q \geq p}}{c_{qp}\phi_{fqp}^{(2)}\phi_{fpq}^{(2)}}}}} - {\frac{1}{2}\phi^{(3)}}}} & {{Equation}\mspace{14mu} 27}\end{matrix}$

It can be observed that the resulting Equation 28 is equivalent to ŷ.

$\begin{matrix}{{g\left( {{\sum\limits_{j = 1}^{m}{h_{j}^{(w)}\left( w_{j} \right)}} + {\sum\limits_{j = 1}^{m}{h_{j}^{(V)}\left( V_{j} \right)}}} \right)} = {\hat{y}\left( {{x;w_{0}},w,V} \right)}} & {{Equation}\mspace{14mu} 28}\end{matrix}$

Hence, ƒ is d-separable where d=rs²+2. This result is interestingbecause d<n where n=mrs+mr.

In addition to distributing training data, the parameters of largemodels may be distributed to leverage parallelism across a cluster. Forexample, recommendations systems at professional networking services mayinclude billions of parameters. Some aspects relate to an approach todistributed training d-separable decompositions of models on sparse datausing the Bulk Synchronous Parallel (BSP) model of computation. Thisapproach leverages data sparsity and the degree of separability toreduce the communication overhead in a distributed procedure.

Training may be formulated as an optimization problem of the form setforth in Equation 29.

$\begin{matrix}{\hat{\theta} = {\underset{\theta}{\arg\;\min}{\sum\limits_{i = 1}^{n}{l\left( {{\hat{y}\left( {x_{i};\theta} \right)},y_{i}} \right)}}}} & {{Equation}\mspace{14mu} 29}\end{matrix}$

Equation 29 uses a labeled example set D=((x_(i), y_(i))_(i=1) ^(n) offeature vectors x_(i)∈R^(q) and labels y_(i)∈R; a differentiable lossfunction l(ŷ_(i); y_(i)) over predictions ŷ_(i) and labels y_(i); and amodel equation ŷ(x; θ) over feature vectors x and parameterized byθ∈R^(m). Some aspects would like to compute the parameters that minimizethe sum of the loss values after applying the model equation to eachfeature vector in the labeled example set. In a distributed environment,this optimization problem is difficult to solve efficiently due to thesignificant number of complicating variables θ. To parallelize thecomputation, some aspects make the objective function additivelyseparable in the variables.

To achieve additive separability, some aspects reformulate theoptimization problem by fixing constants and using d-separabledecompositions of the model equation. Since the labeled example set isfixed during training, some aspects assume that the labeled example setcan be distributed so that some aspects can fix a loss application andmodel application to each labeled example leading to 2n new functions,represented as Equation 30.ŷ _(i)(θ)=ŷ(x _(i);θ)l _(i)(ŷ _(i))=l(ŷ _(i)(θ),y _(i))   Equation 30

These new functions allow reformulation of the optimization problemstrictly in terms of the parameters, as shown in Equation 31, which issubject to Equation 32. Thus, d-separable decompositions enablereformulation of the objective function to be additively separable.

$\begin{matrix}{\hat{\theta} = {\underset{\theta}{\arg\;\min}{\sum\limits_{i = 1}^{n}{l_{i}\left( {{\hat{y}}_{i}(\theta)} \right)}}}} & {{Equation}\mspace{14mu} 31} \\{{\phi_{i} = {\sum\limits_{j = 1}^{m}{h_{ij}\left( \theta_{j} \right)}}},\mspace{11mu}{i = 1},\ldots\mspace{14mu},{n.}} & {{Equation}\mspace{14mu} 32}\end{matrix}$

From Equation 31, some aspects are interested in optimizing θ whilekeeping each ϕ₁ . . . , ϕ_(n) feasible. Some aspects solve this problemby parallelizing updates in a descent method. This method could havebeen applied to the original problem; however, the introduction of theϕ₁ . . . , ϕ_(n) variables exposes a two-stage, parallel dynamicprogramming procedure for each update iteration, as shown in Equation33, where ƒ₀ is the objective function of Equation 34.

$\begin{matrix}\left. \theta\leftarrow{\theta - {\eta{\nabla_{\theta}{f_{0}\left( {\phi_{1},\ldots\mspace{14mu},\phi_{n}} \right)}}}} \right. & {{Equation}\mspace{14mu} 33} \\{{{f_{0}\left( {\phi,\ldots\mspace{14mu},\phi_{n}} \right)} = {\sum\limits_{i = 1}^{n}{l_{i}\left( {g_{i}\left( \phi_{i} \right)} \right)}}},\;{i = 1},\;\ldots\mspace{11mu},{n.}} & {{Equation}\mspace{14mu} 34}\end{matrix}$

It should be observed that the gradient of the objective function withrespect to the original parameters ∇_(θ)ƒ₀(ϕ₁, . . . , ϕ_(n)) may becomputed in parallel if the ϕ₁, . . . , ϕ_(n) variables wereprecomputed. So, a single iteration of gradient descent may use twosuper-operations in a distributed algorithm. First, compute ϕ₁, . . . ,ϕ_(n) in parallel given θ₁, . . . , θ_(m). Second, update θ₁, . . . ,θ_(m) in parallel after computing ∇_(θ)ƒ₀(ϕ₁, . . . , ϕ_(n)). The fullalgorithm is outlined in Algorithm 1.

Algorithm 1: Separable Gradient Descent Data: Parameters θ, featurematrix X, and a label vector y Results: Optimized parameters w₀ and Θ  1while not converged do |  # Inner pass  2 |  for j ← 1 to m do inparallel  3 |  | for i ∈ I_(j) do  4 |  | | h_(ij) ← h_(ij)(θ_(j))  5 | | end  6 |  end |  # Outer pass  7 |  for i ← 1 to n do in parallel  8|  | ϕ_(i) ← Σ_(j∈J) _(i) h_(ij)  9 |  | for j ∈ J_(i) do 10 |  |  | $\left. \frac{\partial l_{i}}{\partial g_{i}}\leftarrow{\frac{\partial l_{i}}{\partial\theta_{j}}\frac{\partial g_{i}}{\partial\theta_{j}}} \right.$11 |  | end 12 |  end |  # Update pass 13 |  for j ← 1 to m do inparallel 14 |  | $\left. \omega_{j}\leftarrow{\sum\limits_{i \in I_{j}}\frac{\partial l_{i}}{\partial\theta_{j}}} \right.$15 |  | θ_(j) ← θ_(j) − ηω_(j) 16 |  end 17 end 18 return θ

The communication cost of the inner pass is O(nd). The communicationcost of the outer pass is O(m ^(n) ). The total communication cost of asingle iteration is O(nd+mn). Observe that a naïve implementation wouldresult in O(nm+mn).

A variety of subsystems may be leveraged to reduce the running time oflocal computations, the communication overhead, and the storageoverhead. Some examples of the design of each subsystem and theirresponsibilities are discussed below.

Aspects are described with respect to the FFM model framework. However,the problems these techniques solve may apply to any models, which havelarge parameter spaces such as Deep Learning models, MatrixFactorization, and others.

In some embodiments, the fragments system describes an optimized set ofdata structures. An object is considered a fragment if it can becombined pairwise with another object of the same type such that theinformation from one object is merged into the other object withoutallocating a new object.

Typically, a fragment is an object that may be used to construct aninstance of some tensor. Consider the example of constructing a sparsevector. A collection it fragments of this sparse vector must also besparse vectors. Combining it fragments naïvely in pairs as immutableobjects may produce O(n) temporary objects. The destructive property ofabsorption prevents allocation unless explicitly required.

In some aspects, there are two distributed data sources, the labeledexample set D and the parameters θ∈R^(m). The current format of D as anordered set is inflexible for distribution. D is decomposed further intoa feature matrix X∈R^(nq) and a label vector y∈R^(n). Some aspectsconstruct row vector i of X from x_(i)∈D for each 1≤i≤n. Each element inthe label vector y is constructed from each y_(i)∈D.

It should be noted that there are two reusable schemes: sharding basedon the example index h_(n): {1, . . . , n}→{1, . . . , n} and shardingbased on the feature index h_(n): {1, . . . , q}→{1, . . . , n}. Thevalues X and y may be sharded by example indices. The value X may besharded by feature indices. The value y is sharded by its example index.Some aspects refer to X_(j) as the column vector j of the matrix, thatis sharded by feature index. Some aspects refer to x_(i) as the rowvector i of the matrix X that is sharded by example index. To distributeθ, some aspects create a sharding scheme by composing h_(q) with a newsharding scheme that maps {1, . . . , m} to {1, . . . , q}.

In the BSP model, X_(j) is guaranteed to be collocated with θ_(j) foreach 1≤j≤q and x_(i) is guaranteed to be collocated with y_(i) for each1≤i≤n. The collocation of data enables performing local operations onthe data sharded by the same index.

When the feature matrix X has both dense features and sparse features,the training algorithm is inefficient because the work becomes unevenlydistributed across the cluster. Note that a dense feature occurs inevery example and a sparse feature typically only occurs in very fewexamples.

It can be observed that the processors assigned with dense features dothe most work between the inner pass and outer pass since they mustreplicate work for each example the original data source. To alleviatethis issue, some aspects decompose the feature matrix X into dense andsparse subsets X_(d) and X_(s). The sparse subset X_(s) is handled asdictated by the original distributed training algorithm. The densesubset X_(d) is instead broadcasted to each machine as a local cachewhere feature values may be fetched in constant time during the outerpass.

Since the labeled example set D) is typically sharded by example indexalready, sharding X by example index is trivial since each row of X isalready on the correct processor. Sharding X by feature index requires adistributed, sparse matrix transpose algorithm, which is performed usingthe fragments system.

Some techniques related to modeling are disclosed below. Modeling mayrefer to techniques to train machine learning (ML) models forspecialized machine learning tasks or to counteract degeneratecircumstances that may occur in practice.

Training a model from a dataset (which may include one or multiple datasources) with a single label is difficult because no negative examplesare provided. (E.g. It is difficult to predict what kinds of peopleCompany X would hire based on a list of its employees, because this listdoes not allow the machine to analyze examples of people who were nothired by Company X. Similarly, if a machine is provided with multiplephotographs of cats, the machine might not be able to learn to identifywhether a photograph includes a cat, because it lacks examples ofphotographs that lack cat(s).) This type of data with a single labeltypically comes from implicit data sets such as click-through data. Thisis a degenerate circumstance. For example, training a model ŷ(x; Θ) topredict that the label is always 1 is easy: let w₀ be 1 and all otherparameters be 0. This degenerate case may be alleviated with a techniquedescribed here as negative sampling for data sets with extremely sparsefeatures.

The negative sampling procedure creates a new data set D={(x _(i)y_(i))}where each x _(i) is the sampled features and each y_(i) is thedesignated negative label, which typically has a value of 0 or −1 forbinary classification but can also be assigned a confidence score basedon the data set distribution.

The negatively sampled feature matrix X may be constructed as follows.Consider the original feature matrix X. Choose some feature index j thatis strongly dependent on the other features. Shuffle column X_(j). Theresulting matrix is X. With an extremely sparse feature matrix, thisresults in a data source of improbable feature combinations which aretypically negative cases. For example, consider a data set of companyco-occurrence on member profiles. Suppose that it is desirable topredict which companies may co-occur in a pairwise fashion with anothercompany. The positive examples may be directly extracted from memberprofiles. Negative examples cannot be directly extracted from memberprofiles. However, if the existing positive example data set is takenand one of the two companies in each example pair are randomized, thenimprobable combinations that may act as a surrogate for negativeexamples are obtained. This strategy is feasible due to the sparsity ofthe data set: given any company, they are statistically unlikely toco-occur with any random company.

Some aspects disclose details on optimizing the negative samplingalgorithm at scale in a distributed system. Observe that the algorithmrequires shuffling a feature column. For data sets that are sharded byexample, this is an expensive operation since it requires data movementbetween multiple machines. Some aspects avoid all data movement betweenmachines (e.g., computation servers 120) by performing local, in-machineshuffles of the partition-local feature column data. This approximationis sufficient for randomness while significantly reducing the runningtime of the overall algorithm and completely eliminating networkoverhead.

The Continuous Bag of Words (CBOW) and Skipgram neural networkarchitectures are used in the Natural Language Processing (NLP)community. These architectures were designed to generate semantic wordembeddings. The FM and by extension, the FFM, are able to approximatelyimitate these models and generate word vectors with carefully designedfeature matrices. Some aspects have successfully implemented a varietyof word embedding models using FFMs for applications such as areas ofexpertise refinement questions and company suggestion refinementquestions in job search or hiring computer-implemented products.

To begin with the derivation of the imitation of CBOW and Skipgram,consider a simple example using the CBOW architecture. Suppose there isa sentence: “The cat climbed a tree.” The CBOW architecture may take“the cat a tree” context as input and attempt to predict the holdoutword climbed. Each word has two associated vectors depending on whetherit is being used as context or holdout. The embedding vector isconventionally chosen to be the context vector. In the presence ofmultiple context words such as in the example, the vectors are averaged.The intuition is that words that co-occur in the same context aresemantically similar.

To imitate the CBOW architecture using FFMs, some aspects create twofields: context and holdout. Given each sentence, for example, “The catclimbed a tree,” some aspects construct several example using a slidingwindow while assigning each word with a field of context or holdout asappropriate. Hence, each word has four vectors associated with it: twofeatures for the context version and the holdout version and eachfeature has two vectors for interacting with each field. Two of four ofthe vectors are unnecessary since some aspects are primarily interestedin context-holdout and holdout-context interactions. By restricting thevector interactions, some aspects are able to approximately achieve theCBOW architecture.

To imitate the Skipgram architecture using FFMs, some aspects observethat the Skipgram architecture can be seen as CBOW with the fieldsswapped: given a single holdout word, predict a context set of words.

Finally, it is worth discussing why the FFM is an approximation ratherthan a complete imitation. For the purpose of speed and scale, the FFMtypically measures pairwise interactions between features while CBOW andSkipgram measure c-interactions where c is the size of the contextwindow. When the CBOW and Skipgram architectures operate on contextwindows of size two, then the FFM can exactly imitate thesearchitectures. Furthermore, it is worth noting that the higher-orderextensions of the FFM also measure c-interactions; however, it isfrequently observed to add negligible predictive power for the cost ofspeed.

Sets of features such as a company history are typically treated as abag of words. This means that each company is encoded in a featurevector and typically has no notion of order. To model chronology offeatures, for example, given a set of companies on a member's profilewhere most recent companies are determined to be more important, someaspects implement time-weighting. Time-weighting is a simple notion ofeither linearly or exponentially assigning weights to the set of historyitems.

For example, suppose some aspects determine that more recent companieson a member profile are more relevant to the prediction task. Someaspects may exponentially weight the companies on a member profile suchthat the most recent company has the largest weight.

From the standardized member profiles of a professional networkingservice, some aspects train a model that is able to predict the nextcompany that may employ the member of the professional networkingservice. The model is trained by observing the work history of themember. Consider a single member. Some aspects remove the member'scurrent company and train the model to predict the held out companygiven the list of previous companies where the member has worked. Thismodeling strategy is an application of the CBOW and Skipgram imitation.Furthermore, the negative sampling modeling technique is applied togenerate negative examples. Finally, some aspects apply the timeweighting over the companies in the work history to account forchronology. The model may be trained this way over all members in a datarepository of a professional networking service. Once the model istrained, some aspects provide the model with a full list of companiesfrom any given member and produce a ranked list of companies that themember may work for next.

Title recommendation proceeds similarly to company recommendation exceptthat some aspects train the model over each member's titles (rather thancompanies) in his/her work history. Title expansion is very similar tothe above title recommendation problem. The main difference is thattitle recommendation seeks to find a cause-effect relationship betweenjob history and future jobs, whereas title similarity is non-temporal innature. Thus, some aspects use the same CBOW-inspired approach as above,but instead of holding out a member's most recent title, some aspectshold out a random title. In addition, each title in the job history hasequal weight, instead of applying a temporal weighting scheme.

When a user (e.g., a recruiter) provides information about a targetrole, some aspects ask if a number of areas of expertise are relevantfor what the user is seeking. To produce the areas of expertise that arepresented to the user, some aspects use a FFM to learn associationstrength between titles and skills. Each training example is pulled fromreal profiles and associations are embedded into latent vectors of theFFM parameters. The latent vectors are then grouped based on apre-existing model of groups defining an “area of expertise” to producea single area of expertise latent vector. Then for any given targetrole, some aspects compute the top areas of expertise that are highlyassociated to the target role, but most disassociated from each otherusing the latent vectors.

Some aspects produce a model that uses the target job parameters (e.g.title, company, geographic location, and the like) to score matches witha user's (e.g., a recruiter's) personal network. The model extracts anumber of properties, particularly prior work history from each memberin the user's network and scores fit with the target role based oninteraction between entities. Some aspects choose a number of topcandidates to present to the user as “suggested candidates” and ask ifany of the candidates are representative of who they user is trying tohire. This information is used to refine the future recommendations.

FIG. 2 is a diagram 200 illustrating an example of model parameters forpredicting feature values in a matrix, in accordance with someembodiments. The data structures of FIG. 2 may be stored at or computedvia the control server 110. In some cases, the data structures of FIG. 2are learned and stored in a distributed fashion (via sharding) duringtraining. The data structures may be sored and used for prediction in acentralized way.

The global bias 210 is a single value w₀. The feature bias 220corresponds to a feature vector w which includes the values w₁, . . . ,w_(m). A training example vector x_(i) is provided with the valuesx_(i1), . . . , x_(im). A training example label y_(i) is also provided.

The latent factors 230 are shown in a three-dimensional matrix V. Asshown, the matrix V has dimensions for factor (ƒ=1, . . . , r), field(p=1, . . . , s), and feature (j=1, . . . , m). Each cell in the matrixV is labeled with a value for ƒ, p, and j, in that order.

FIG. 3 is a data flow diagram 300 for predicting feature values in amatrix, in accordance with some embodiments. As shown, the data flowdiagram 300 includes the control server 110 and the computation servers120.1-3 of FIG. 1.

The control server 110 stores the matrix V, the vectors w and x_(i), andthe values w₀ and y_(i). As shown, the matrix V is sharded—divided intothree submatrices along the feature dimension, with each of the threesubmatrices being assigned to one of the three computation servers120.1-3. The vectors w and x_(i) are also sharded—divided into threesubvectors, with each of the three subvectors being assigned to one ofthe three computation servers 120.1-3. The parameters w₀ and y_(i) arenot sharded, as w₀ is a scalar and y_(i) is not used for prediction. Allof the sharded parameters are sharded along the feature dimension j. Inother words, each of the computation servers 120.k is responsible for agiven range of j values. As shown, the computation server 120.1 isresponsible for j values 1 and 2; the computation server 120.2 isresponsible for j values 3 and 4; and the computation server 120.3 isresponsible for j values 5 through m.

Consider what happens at a single computation server 120.k. Thecomputation server 120.k is responsible for a range of j values(features). For simplicity, suppose the range covers only a single valuej. This value j may correspond to a single feature, for example, thecompany “ABC Corporation,” or the title “legal assistant.” The storeddata about this feature includes: w_(j) (the feature bias), x_(ij) (thetraining example for the feature value, which may be greater than zeroif the feature is present or zero if the feature is absent), and V_(j)(a two-dimensional matrix corresponding to the slice ofthree-dimensional matrix V for the feature j). The goal is to computethe shard's values of ϕ⁽¹⁾, ϕ⁽²⁾, and ϕ⁽³⁾. ϕ⁽¹⁾ is computed accordingto Equation 35. ϕ⁽¹⁾ is computed according to Equation 36, where p isthe field of j. For example, if j referenced “ABC Corporation,” p wouldreference “employer.” If j referenced “legal assistant,” p wouldreference title. Labels may be handled in a similar manner to theexamples.ϕ⁽¹⁾ =wj*xij   Equation 35ϕ⁽³⁾ =x _(j) ²Σ_(ƒ=1) ^(r) V _(fpj) ²   Equation 36

ϕ⁽²⁾ is more complicated, as it is not just as scalar, but is the s×s×rtensor A referenced above. ϕ⁽²⁾ is a full s×s×r tensor. However, for aparticular feature j and its field p, in some embodiments, only thevalues in slice p of the tensor are computed. Different features mayaffect different slices of A's sub-tensors, so that when they arecomputed, A is fully populated.

FIG. 4 illustrates a slice A_(p) of the three-dimensional matrix V. Tocalculate A_(p) for just this j, it is simply the scalar-matrix productx_(ij)*V_(j) ^(T), so A_(fpj)=x_(ij)*V_(fpj) for all ƒ=1 . . . r and j=1. . . m. (The superscript ‘T’ indicates a transpose operation.) Tofinish computing A, the computation server 120.k takes each A_(j)produced above and adds them together. The same occurs with the ϕ⁽¹⁾ andϕ⁽³⁾ shards. It should be noted that this is a “reduce” step thatcombines the results for every machine. Once ϕ⁽¹⁾, A and ϕ⁽³⁾ arepresent, the prediction can be computed on the computation server 120.kvia Equations 7 and 8.

FIG. 5 is flow chart of a method 500 for training a machine, inaccordance with some embodiments. The method 500 is described here asbeing implemented at the control server 110 of the system 100. However,the method 500 may be implemented at other machine(s) or in othersystem(s).

At operation 510, the control server 110 accesses a matrix, such as anexample matrix with features. The matrix has multiple dimensions. Onedimension of the matrix represents features. Another dimension of thematrix represents data points.

At operation 520, the control server 110 separates the matrix intomultiple submatrices along a first dimension (e.g. the feature dimensionas shown in FIG. 3). Each submatrix includes all cells in the matrix fora set of values in the first dimension.

At operation 530, the control server 110 provides the multiplesubmatrices to multiple computation servers 120. In some examples, eachcomputation server 120.k is provided with a single submatrix.

At operation 540, the control server 110 causes each computation server120.k to compute a correlation between values in second dimension(s)(e.g. the factor dimension and/or the field dimension) of the matrix anda value for a preselected feature of the matrix. The correlation is usedto predict the values of the preselected feature based on the valuesalong the second dimension(s). The correlation between the seconddimension(s) of the matrix and the feature value may be based on adifferentiable mathematical function applied to values in the at leastone second dimension. In one example, the feature dimension representsemployers. One of the second dimension(s) represents individuals. Anon-zero value in a cell of the matrix represents that the individual isa current or former employee of the employer. In some cases, another oneof the second dimension(s) represents area(s) of expertise, current orformer educational institution(s) attended, or degree(s) earned. In someembodiments, computing the correlation includes computing, for at leastone additional feature different from the preselected feature, aprobability that, for a slice of the matrix along the at least onesecond dimension, the preselected feature has a non-zero value giventhat the at least one additional feature has a non-zero value. Thecorrelation may be computed based on the non-zero values in a slicealong the at least one second dimension and the computed probability. Insome examples, the operation 520 includes separating the matrix intosubmatrices along the feature dimension. The preselected feature ofoperation 540 is one of the features along the feature dimension.

At operation 550, the control server 110 provides an output representingthe computed correlation (or any other prediction). The output may bestored in a data repository coupled with the control server 110 ordisplayed at the client device 130. In some cases, the control server110 receives a new submatrix including values along the seconddimension(s). In some implementations, the control server 110 predicts,using the computed correlation, a value for the preselected feature forthe new submatrix. In some cases, the output includes a combination ofcorrelations, and the semantics of the output is designed based on thetype of prediction task.

FIG. 6 is a schematic diagram of a technique 600 for associating a titlewith at least one area of expertise (AoE). In some aspects, thetechnique 600 associates a title with an AoE by trying to balancerelevance and variety simultaneously.

At block 610, a plurality of areas of expertise (AoEs 1-5) are accessedby the computation server 120.k. While five areas of expertise arepresented here, the technology may be implemented with any number ofareas of expertise. In some examples, the computation server 120.kaccesses (e.g., via the network 140 or via the control server 110) adata repository storing thousands of areas of expertise. The areas ofexpertise may correspond to professional areas of expertise, such as“web development,” “back end development,” “front end development,”“patent drafting and prosecution,” or “patent litigation.”

At block 620, a title is accessed by the computation server 120.k. Thetitle may correspond to a professional title, such as “senior softwareengineer,” “patent attorney,” or “insurance agent.”

At block 630, the areas of expertise (from block 610) and the title(from block 620) are mapped, by the computation server 120.k, to a FMgenerated latent space. As shown, the FM generated latent space istwo-dimensional. However, in some embodiments, a latent space with morethan two dimensions (e.g., three, four, or five dimensions) may be used.To generate this mapping, latent vectors learned by a factorizationmachine trained on examples of <Current Title, Skills> tuples areextracted from a data store. In some examples, the data store storesmember profiles from a professional networking service. According tosome examples, the computation server 120.k extracts training examplesfrom the data store. Each training example includes a <current title,skills> tuple. The computation server 120.k trains a factorizationmachine. It should be noted that latent vectors learned by thefactorization machine may be viewed as a mapping of titles to the latentspace and a mapping of skills to the latent space. In other words, foreach title, there is a vector that numerically represents the title. Foreach skill, there is a vector that numerically represents the skill.

At block 640, the areas of expertise are arranged, at the computationserver 120.k, by distance from the title in the latent space of block630. Areas of expertise having a greater distance from the title than apredefined threshold distance (e.g., AoE 3, as shown) are filtered out.The remaining areas of expertise (e.g., AoE 2, AoE 4, AoE 5, and AoE 1)are provided to block 650.

At block 650, the remaining areas of expertise (from block 640) and thetitle are mapped, by the computation server 120.k, onto the latent spacefrom block 630. Three random points 651 are mapped onto the latentspace. Pushing forces 652 are modeled from each random point 651 to eachother random point 651. Pulling forces 653 are modeled from each randompoint 651 to the title. The direction of the forces 652 and 653 iscomputed by subtracting the latent vectors. For example, to compute thepulling force 653 to the title, the computation server 120.k computes:Latent Vector (Title)—Latent Vector (point). Then the computation server120.k normalizes the result to have a unit vector representing thedirection of the force. The magnitude of each of the forces 652 and 653may be a hyper-parameter. The hyper-parameter could be changed tosomething that is similar to a magnet (e.g., higher magnitude when closetogether) or some other scheme. In some cases, a constant forcemagnitude may be used. In some cases, an early stopping mechanism may beused. In other words, there may be a configurable number of iterationsto determine when equilibrium is reached. In one example, fiveiterations are performed before the simulation of the forces is stopped.After the simulation of the forces is completed, the sampled points areremapped to areas of expertise.

As described in conjunction with block 630, there is a mapping of skillto latent vector from the factorization machine. A latent vector can beconstructed to represent any area of expertise by averaging the vectorsof its constituent skills. It should be noted that this provides amapping from an area of expertise to its corresponding vector.Originally, all of the entities of interest—titles and areas ofexpertise—are mapped to the latent space of block 630. The latent spaceis used for mathematical operations. However, the result of theseoperations is a vector in the latent space, which is converted to anentity of interest before being provided as output to a user.

At block 660, The three areas of expertise closest to the title afterthe simulation (e.g., AoE 1, AoE 2, and AoE 5) are presented to a humanuser via a client device 130 in communication with the computationserver 120.k (e.g., via the network 140). The human user specifieswhich, if any, of the areas of expertise are applicable to the title.The areas of expertise selected by the human user are transmitted to thecomputation server 120.k.

FIG. 7 is a schematic diagram of a technique 700 for predicting valuesin a matrix.

Block 710 shows a matrix with feature columns (e.g., representingbusinesses A, B, C, and D) and example rows (e.g., representingindividuals Alice, Bob, and Charlie). A “1” (or other non-zero value) inthe matrix represents that the individual currently works or haspreviously worked at the business (e.g., Alice currently works or haspreviously worked at A). A “0” in the matrix represents that theindividual does not work and has never worked at the business (e.g., Bobdoes not work and has never worked at D). As shown, the y-axis of thematrix in block 710 represents people—Alice, Bob, and Charlie. However,each row is a data point and does not necessarily correspond to aperson. In some cases, a single person may correspond to multipleexamples. In some cases, multiple people may correspond to a singleexample.

Block 720 shows the matrix with the names of the individuals andbusinesses mapped to integers and represented by a legend. At block 730,the features are divided into feature shards (f1 through f4), and theexamples are divided into example shards (x1 through x3).

At block 740, the shards are provided to machines (e.g., computationservers 120). Each machine has an inner component for the featureshards, and an outer component for example shards. Blocks 730 and 740may together sketch out a sharding scheme—partitioning in block 730 anddistribution in block 740.

At block 750, the inner component of each machine is separated from theouter component, and parameters are initialized initializing theparameters may entail allocating vectors (at least one) per feature ineach shard at each machine. Some aspects generally allocate vectors ofsome dimension between 16 to 1024 and assign random values to eachfactor, based on a normal distribution.

At block 760, a distributed environment is used. A feature partitionscheme and an example partition scheme are provided, by executors, tothe control server 110. Features are partitioned by the featurepartition scheme. Examples are partitioned by the example partitionscheme. The control server 110 may perform the operations of blocks 720and 730, and may distribute the shards to the machines at block 740.Some aspects minimize the amount of data movement and coordinationbetween each individual machine of the machines to which the shards areprovided at block 740.

At block 770, the machines run an inner pass to generate a firstcorrelation, the first correlation correlating data in an innercomponent of at least a first machine with data in the outer componentsof the plurality of machines.

At block 780, the machines run an outer pass to generate a secondcorrelation, the second correlation correlating data in an outercomponent of at least a second machine with data in the inner componentsof the plurality of machines. The first correlation and the secondcorrelation may be stored, for example, at the control server 110. Anoutput representing at least the first correlation and the secondcorrelation may be provided. As used herein, the phrases “inner pass”and “outer pass” may refer to making a prediction based on currentlearned parameters (e.g. factor values) and comparing to the label tocompute an error. The error signal is propagated to improve theparameters. Using the inner pass and outer pass, some aspects provideimplementation level techniques for scaling machine learning problems.

FIG. 8 is a flow chart of a method 800 for training a machine andpredicting values in a matrix. The method 800 is described here as beingimplemented within the system 100 of FIG. 1. However, the method 800 mayalso be implemented using other machines in other systems.

At operation 810, the control server 110 accesses a matrix. The matrixhas feature columns and example rows. In some embodiments, the featurerepresent businesses and the examples represent individuals/entities. Anon-zero value in the matrix associated with a first individual and afirst business indicates that the first individual is or has beenemployed at the first business. A zero value in the matrix associatedwith a second individual and a second business indicates that the secondindividual is not and has not been employed at the second business.

At operation 820, the control server 110 shards the matrix by featuresand by examples to generate feature shards and example shards,respectively. Each feature shard includes at least one feature column,and each example shard includes at least one example row. In some cases,each of the feature shards has the same number of feature columns, andeach of the example shards has the same number of example columns. Thenumber of feature shards may be equal to the number of feature columnsin the matrix divided by the number of computation servers 120. Thenumber of example shards may be equal to the number of example rows inthe matrix divided by the number of computation servers 120. Theoperation 820 may correspond to the block 730.

At operation 830, the control server 110 distributes the feature shardsand the example shards among the plurality of computation servers 120.Each computation server 120.k includes an inner component storing atleast one feature shard and an outer component storing at least oneexample shard. The operation 830 may correspond to the block 740.

At operation 840, the computation servers 120 run an inner pass togenerate a first correlation. The first correlation correlates data inan inner component of at least a first computation server 120.1 withdata in the outer components of the plurality of computation servers120. The operation 840 may correspond to the block 770.

At operation 850, the computation servers run outer pass to generate asecond correlation. The second correlation correlates data in an outercomponent of at least a second computation server 120.2 with data in theinner components of the plurality of computation servers 120. After theinner pass and outer pass are completed, the first correlation and thesecond correlation may be stored at the control server 110. The controlserver 110 may provide an output associated with at least the firstcorrelation and the second correlation. The operation 850 may correspondto the block 780. In some implementations, the operations 840 and 850are done multiple times in the training phase. In some cases, theoperations 810-840 may correspond to the operations of FIG. 5, withparts of the operation 850 including an additional step of computingerror and learning.

In some implementations, the control server 110 receives a new examplerow for the matrix. To generate a recommendation, the control server 110predicts, based on the first correlation or the second correlation, thatat least one zero value in the new example row should be non-zero. Insome cases, the zero value(s) that should be non-zero are associatedwith a specific feature. The recommendation output provided by thecontrol server 110 may include an indication that an individualassociated with the new example row should work at a business associatedwith the specific feature In other words, the individual may be a goodfit for an employment position at the business.

FIG. 9 is a flow chart of a method 900 for ranking job candidates, forexample, to generate a recommendation of a job candidate for a specifiedemployment position. As described here, the method 900 may beimplemented at the control server 110 of the system 100. Alternatively,the method 900 may be implemented at other machine(s) or in othersystem(s).

At operation 910, the control server 110 receives, from the clientdevice 130, a request for job candidates for an employment position. Therequest includes criteria. For example, a request may specify a softwareengineer in the San Francisco Bay Area with at least a Master's Degreeand at least five years of experience. The client device 130 may beoperated by a recruiter or headhunter.

At operation 920, the control server 110 generates, based on therequest, a set of job candidates for the employment position. Forexample, the control server may access a data repository (e.g., aprofessional networking service or an applicant tracking system) andobtain job candidates by filtering job candidates that meet the criteriafrom the data repository. In some cases, the set of job candidates aregenerated based on the criteria in the request and based on additionalcriteria. The additional criteria are determined based on storedinformation associated with a user of the client device 130. The storedinformation may include, for example, a company for which the userrecruits and whether the user is an employee of that company or anemployee, contractor or worker of a staffing agency. The storedinformation may include whether the user is a recruiter or anon-recruiting professional, an industry associated with the user, and acurrent employer of the user.

At operation 930, the control server 110 provides, to the client device130, a prompt for ranking the set of job candidates. At operation 940,the control server 110 receives, from the client device 130, a responseto the prompt. At operation 950, the control server 110 ranks the set ofjob candidates based on the received response. The operations 930 and940 relate to refinement. The user of the client device is asked someintelligent questions, which may be based on the FFM model. Thesequestions may be generated in conjunction with the operation 920. Theranking of operation 950 is based on multiple factors, including theanswers to the refinement questions from operations 930 and 940.

In some implementations, the prompt includes a request to identify anaccount of an individual meeting the search criteria for the employmentposition. The ranking of the set of job candidates is based on one ormore attributes of the identified account. The attributes may includeone or more of skills, titles, industries, current and past employers,current and past educational institutions, degrees obtained, areas ofstudy, job function, years of experience, and businesses interacted within a professional networking service.

In some implementations, the prompt includes a request for a user of theclient device 130 to select a prior employer from a set of prioremployers. For example, the user may be asked if he/she prefers a jobcandidate who previously worked at ABC Corporation, DEF Corporation, orGHI Corporation. The ranking of the set of job candidates is based onthe prior employer selected by the user of the client device. Forexample, if ABC Corporation is selected, then job candidates who haveworked at ABC Corporation or similar companies are ranked higher thanjob candidates who have worked at DEF Corporation (and similarcompanies) or GHI Corporation (and similar companies). In some cases,the control server 110 computes, for each employer in a set ofemployers, a similarity score to the prior employer selected by the userof the client device (e.g., a similarity score to ABC Corporation). Thesimilarity score is computed based on a number (e.g., in a professionalnetworking service or other data repository) of current or formeremployees of the employer, a number of employees of the prior employer,and a number of common current or former employees of the employer andthe prior employer. Ranking the set of job candidates based on the prioremployer includes removing, from the set of job candidates, at least onejob candidate who lacks a prior employer having at least a thresholdsimilarity score.

In some implementations, the similarity score is computed using machinelearning. The machine learning takes into account weighted employmenthistories and a current employer of a first set of individuals. Themachine learning further takes into account negative sampling ofweighted employment histories of a second set of individuals. Eachindividual in the second set of individuals corresponds to a realindividual's past employment history, but is assigned a fictitiouscurrent employer for the negative sampling. Negative sampling isdiscussed in more detail in conjunction with FIGS. 10-11, below.

In some implementations, the prompt includes a request for a user of theclient device 130 to select an area of expertise from a set of areas ofexpertise. The ranking of the set of job candidates is based on the areaof expertise selected by the user of the client device 130.

At operation 960, the control server 110 provides, for display at theclient device 130, an output based on the ranked set of job candidates.For example, the output may include all of the job candidates in the sethaving a ranking that exceeds a threshold ranking value In some cases,the N highest-ranked job candidates in the set are displayed, where N isa positive integer, for example, 5, 10, or 20.

In some cases, the control server 110 (or another machine) determines,for at least one job candidate from the set of job candidates, one ormore areas of expertise based on one or more identified skills of thejob candidate. Techniques for determining areas of expertise aredescribed in conjunction with FIG. 6. The identified skill(s) maycorrespond to the title provided in block 620.

Machine learning may be used to predict data that can exist in the realworld. Machine learning typically relies on providing positive truesamples and negative false samples, and teaching the machine todistinguish between the positive and negative samples. Positivereal-world data is relatively easy to obtain. (E.g., In a machinelearning algorithm that uses an individual's history of employers topredict a current employer, positive samples can be obtained frompublicly shared data in a professional networking service.) However,obtaining negative samples (e.g., samples of individuals who did notwork at a company or individuals who were rejected by the company) maybe challenging. Some aspects of the technology described herein addressthis challenge. Some aspects of the sampling technique described hereinmake up negative examples. Some of the negative examples may beincorrect (e.g., if an individual received an offer from a company, butdid not join, or if an individual worked at a company, but this is notindicated in the data available about the individual). However, theexploited characteristic is likely to be correct since the vast majorityof people cannot or could not work at a random employment position fromthe entire set of possibilities.

FIG. 10 is a flow chart of a method 1000 for negative sampling. Asdescribed here, the method 1000 is implemented within the system 100 ofFIG. 1. However, the method 1000 may be implemented at other machine(s)or in other system(s).

At operation 1010, the control server 110 accesses (e.g., from a datarepository) a matrix. The matrix has rows representing entities (e.g.,individuals) and columns representing features (e.g., employers). Anexample matrix is described in conjunction with FIG. 11. The accessedmatrix may be generate based on data stored in a professional networkingservice.

At operation 1020, the control server 110 selects a specific subset ofcolumns (which may include at least one column) in the matrix forrandomization. The subset of columns may include, for example, a columnthat represents a current employer of the employee (in other words, acurrent feature of the entity or other specified feature of the entity).In some implementations, the entities are employees, and the featuresare current employers and former employers of the employees, and thespecific column is associated with the current employers In someimplementations, the entities are high school (or other) students, thefeatures are high school grades in a plurality of subjects, scores on aplurality of exams, and higher education institution (e.g. college)attended, and the specific column is associated with the highereducation institution attended.

At operation 1030, the control server 1010 partitions the example matrixby example row into multiple submatrices.

At operation 1040, the control server 1010 assigns the multiplesubmatrices to multiple computation servers 120. In someimplementations, each submatrix is assigned to one computation server120.k.

At operation 1050, each computation server 120.k shuffles the values ineach column in the specific subset of columns among the rows of thesubmatrix assigned to the computation server 120.k In some cases, thecomputation servers 120 shuffle the values in the specific column inparallel. The computation servers 120 provide the shuffled submatricesto the control server 110.

At operation 1060, the control server 110 merges the shuffledsubmatrices into a shuffled matrix. The control server provides anoutput representing the shuffled matrix. It should be noted that theoriginal accessed matrix may represent real world data, and the shuffledmatrix may represent fictitious negative sampling data. In some cases,output representing the shuffled matrix is provided to a machineimplementing a machine learning training algorithm for predicting valuesin the specific column. The shuffled matrix is used for negativeexamples for the machine learning training algorithm, and the accessedmatrix is used for positive examples for the machine learning trainingalgorithm.

FIG. 11 is schematic diagram 1100 of matrices that may be used innegative sampling. As shown, the diagram 1100 includes four matrices1110, 1120, 1130, and 1140. Matrices 1110 and 1120 represent real worlddata and may be generated, for example, from publicly accessible data ina professional networking service. Matrices 1130 and 1140 representfictitious negatively sampled data and may be generated, for example,using the negative sampling techniques described herein. As shown in thematrices of the diagram 1100, each row represents a person. However, insome cases, a person may be represented in multiple rows or multiplepeople may be represented in a single row.

Matrix 1110 has rows representing four data points, which are extractedfrom four individuals—Albert, Betsy, Carlos, and Diana—and columnsrepresenting eight corporations—A, B, C, D, E, F, G, and H. The valuesrepresent a weighted past employment history of the individuals, wherethe most recent past employer has twice the weight of the second mostrecent, which has twice the weight of the third most recent, etc. Thesum of the weights for each individual is 1.

For example, Albert's most recent past employer is E. Before E, Albertwas employed at C, and before C, Albert was employed at A. Thus, theweight for the Albert-E cell (4/7) is twice that for the Albert-C cell(2/7), which is twice that for the Albert-A cell (1/7). The weights forthe other corporations—B, D, F, G, and H, are blank or zero because theyare not Albert's past employers. It should be noted that the sum of allof the weights for Albert is 1.

Betsy's most recent past employer is D, and before that Betsy wasemployed at B. Thus, Betsy's weight for D (2/3) is twice that for B(1/3). Betsy's weights for the other corporations is blank or zerobecause they are not Betsy's past employers. It should be noted that thesum of all of the weights for Betsy is 1.

Carlos' most recent past employer is D. Before D, Carlos was employed atC, and before C, Carlos was employed at B. Thus, the weight for theCarlos-D cell (4/7) is twice that for the Carlos-C cell (2/7), which istwice that for the Carlos-B cell (1/7). The weights for the othercorporations—A, E, F, G, and H, are blank or zero because they are notCarlos' past employers. It should be noted that the sum of all of theweights for Carlos is 1.

Diana's most recent past employer is E, and before that Betsy wasemployed at A. Thus, Diana's weight for E (2/3) is twice that for A(1/3). Diana's weights for the other corporations is blank or zerobecause they are not Diana's past employers. It should be noted that thesum of all of the weights for Diana is 1.

Matrix 1120 has the same rows (Albert, Betsy, Carlos, and Diana) andcolumns (A, B, C, D, E, F, G, and H) as matrix 1110. Matrix 1111 has a 1in the cell representing the current employer of each individual, and ablank/0 in other cells in the individual's row. As shown, Albert'scurrent employer is D, Betsy's current employer is F, Carlos' currentemployer is G, and Diana's current employer is H.

Matrices 1130 and 1140 represent fictitious negatively sampled data andmay be generated, for example, using the negative sampling techniquesdescribed herein. As shown, the cells of matrix 1130 have the samevalues as those of matrix 1110. However, the individuals are labeled“Fictitious Albert,” “Fictitious Betsy,” “Fictitious Carlos,” and“Fictitious Diana,” in place of “Albert,” “Betsy,” “Carlos,” and“Diana.” In matrix 1140, the rows of the matrix 1120 are scrambled, suchthat the first row (representing Albert) of matrix 1120 becomes thefourth row (representing fictitious Diana) in matrix 1140, the secondrow (representing Betsy) of matrix 1120 becomes the third row(representing fictitious Carlos) in matrix 1140, the third row(representing Carlos) of matrix 1120 becomes the first row (representingfictitious Albert) in matrix 1140, and the fourth row (representingDiana) of matrix 1120 becomes the second row (representing fictitiousBetsy) in matrix 1140. As a result, matrix 1140 indicates thatfictitious Albert's current employer is G Corp., fictitious Betsy'scurrent employer is H Corp., fictitious Carlos' current employer is FCorp., and fictitious Diana's current employer is D Corp. These aredifferent from the real world current employers of matrix 1120.

As a result of the schematic diagram 1100, a machine learning algorithmthat tries to predict, for an individual, a current employer based onthe individual's past employers, may have real world positive samplesand fictitious negative samples. In a scenario where there are thousandsof employers only a few of which are a good fit for an employee,negative sampling by randomly selecting a business to replace the“current employer” field may result in provision of multiple negativesamples for training the machine learning algorithm.

Modules, Components, and) Logic

Certain embodiments are described herein as including logic or a numberof components, modules, or mechanisms. Modules may constitute eithersoftware modules (e.g., code embodied on a machine-readable medium) orhardware modules. A “hardware module” is a tangible unit capable ofperforming certain operations and may be configured or arranged in acertain physical manner. In various example embodiments, one or morecomputer systems (e.g., a standalone computer system, a client computersystem, or a server computer system) or one or more hardware modules ofa computer system (e.g., a processor or a group of processors) may beconfigured by software (e.g., an application or application portion) asa hardware module that operates to perform certain operations asdescribed herein.

In some embodiments, a hardware module may be implemented mechanically,electronically, or any suitable combination thereof. For example, ahardware module may include dedicated circuitry or logic that ispermanently configured to perform certain operations. For example, ahardware module may be a special-purpose processor, such as aField-Programmable Gate Array (FPGA) or an Application SpecificIntegrated Circuit (ASIC). A hardware module may also includeprogrammable logic or circuitry that is temporarily configured bysoftware to perform certain operations. For example, a hardware modulemay include software executed by a general-purpose processor or otherprogrammable processor. Once configured by such software, hardwaremodules become specific machines (or specific components of a machine)uniquely tailored to perform the configured functions and are no longergeneral-purpose processors. It may be appreciated that the decision toimplement a hardware module mechanically, in dedicated and permanentlyconfigured circuitry, or in temporarily configured circuitry (e.g.,configured by software) may be driven by cost and time considerations.

Accordingly, the phrase “hardware module” should be understood toencompass a tangible entity, be that an entity that is physicallyconstructed, permanently configured (e.g., hardwired), or temporarilyconfigured (e.g., programmed) to operate in a certain manner or toperform certain operations described herein. As used herein,“hardware-implemented module” refers to a hardware module. Consideringembodiments in which hardware modules are temporarily configured (e.g.,programmed), each of the hardware modules need not be configured orinstantiated at any one instance in time. For example, where a hardwaremodule comprises a general-purpose processor configured by software tobecome a special-purpose processor, the general-purpose processor may beconfigured as respectively different special-purpose processors (e.g.,comprising different hardware modules) at different times. Softwareaccordingly configures a particular processor or processors, forexample, to constitute a particular hardware module at one instance oftime and to constitute a different hardware module at a differentinstance of time.

Hardware modules can provide information to, and receive informationfrom, other hardware modules. Accordingly, the described hardwaremodules may be regarded as being communicatively coupled. Where multiplehardware modules exist contemporaneously, communications may be achievedthrough signal transmission (e.g., over appropriate circuits and buses)between or among two or more of the hardware modules. In embodiments inwhich multiple hardware modules are configured or instantiated atdifferent times, communications between such hardware modules may beachieved, for example, through the storage and retrieval of informationin memory structures to which the multiple hardware modules have access.For example, one hardware module may perform an operation and store theoutput of that operation in a memory device to which it iscommunicatively coupled. A further hardware module may then, at a latertime, access the memory device to retrieve and process the storedoutput. Hardware modules may also initiate communications with input oroutput devices, and can operate on a resource (e.g., a collection ofinformation).

The various operations of example methods described herein may beperformed, at least partially, by one or more processors that aretemporarily configured (e.g., by software) or permanently configured toperform the relevant operations. Whether temporarily or permanentlyconfigured, such processors may constitute processor-implemented modulesthat operate to perform one or more operations or functions describedherein. As used herein, “processor-implemented module” refers to ahardware module implemented using one or more processors.

Similarly, the methods described herein may be at least partiallyprocessor-implemented, with a particular processor or processors beingan example of hardware. For example, at least some of the operations ofa method may be performed by one or more processors orprocessor-implemented modules. Moreover, the one or more processors mayalso operate to support performance of the relevant operations in a“cloud computing” environment or as a “software as a service” (SaaS).For example, at least some of the operations may be performed by a groupof computers (as examples of machines including processors), with theseoperations being accessible via a network (e.g., the Internet) and viaone or more appropriate interfaces (e.g., an API).

The performance of certain of the operations may be distributed amongthe processors, not only residing within a single machine, but deployedacross a number of machines. In some example embodiments, the processorsor processor-implemented modules may be located in a single geographiclocation (e.g., within a home environment, an office environment, or aserver farm). In other example embodiments, the processors orprocessor-implemented modules may be distributed across a number ofgeographic locations.

Machine and Software Architecture

The modules, methods, applications, and so forth described inconjunction with FIGS. 1-10 are implemented in some embodiments in thecontext of a machine and an associated software architecture. Thesections below describe representative software architecture(s) andmachine (e.g., hardware) architecture(s) that are suitable for use withthe disclosed embodiments.

Software architectures are used in conjunction with hardwarearchitectures to create devices and machines tailored to particularpurposes. For example, a particular hardware architecture coupled with aparticular software architecture may create a mobile device, such as amobile phone, tablet device, or so forth. A slightly different hardwareand software architecture may yield a smart device for use in the“internet of things,” while yet another combination produces a servercomputer for use within a cloud computing architecture. Not allcombinations of such software and hardware architectures are presentedhere, as those of skill in the art can readily understand how toimplement the inventive subject matter in different contexts from thedisclosure contained herein.

Example Machine Architecture and Machine-Readable Medium

FIG. 12 is a block diagram illustrating components of a machine 1200,according to some example embodiments, able to read instructions from amachine-readable medium (e.g., a machine-readable storage medium) andperform any one or more of the methodologies discussed herein.Specifically, FIG. 12 shows a diagrammatic representation of the machine1200 in the example form of a computer system, within which instructions1216 (e.g., software, a program, an application, an applet, an app, orother executable code) for causing the machine 1200 to perform any oneor more of the methodologies discussed herein may be executed. Theinstructions 1216 transform the general, non-programmed machine into aparticular machine programmed to carry out the described and illustratedfunctions in the manner described. In alternative embodiments, themachine 1200 operates as a standalone device or may be coupled (e.g.,networked) to other machines. In a networked deployment, the machine1200 may operate in the capacity of a server machine or a client machinein a server-client network environment, or as a peer machine in apeer-to-peer (or distributed) network environment. The machine 1200 maycomprise, but not be limited to, a server computer, a client computer,PC, a tablet computer, a laptop computer, a netbook, a set-top box(STB), a personal digital assistant (PDA), an entertainment mediasystem, a cellular telephone, a smart phone, a mobile device, a wearabledevice (e.g., a smart watch), a smart home device (e.g., a smartappliance), other smart devices, a web appliance, a network router, anetwork switch, a network bridge, or any machine capable of executingthe instructions 1216, sequentially or otherwise, that specify actionsto be taken by the machine 1200. Further, while only a single machine1200 is illustrated, the term “machine” shall also be taken to include acollection of machines 1200 that individually or jointly execute theinstructions 1216 to perform any one or more of the methodologiesdiscussed herein.

The machine 1200 may include processors 1210, memory/storage 1230, andI/O components 1250, which may be configured to communicate with eachother such as via a bus 1202. In an example embodiment, the processors1210 (e.g., a Central Processing Unit (CPU), a Reduced Instruction SetComputing (RISC) processor, a Complex Instruction Set Computing (CISC)processor, a Graphics Processing Unit (GPU), a Digital Signal Processor(DSP), an ASIC, a Radio-Frequency Integrated Circuit (RFIC), anotherprocessor, or any suitable combination thereof) may include, forexample, a processor 1212 and a processor 1214 that may execute theinstructions 1216. The term “processor” is intended to includemulti-core processors that may comprise two or more independentprocessors (sometimes referred to as “cores”) that may executeinstructions contemporaneously. Although FIG. 12 shows multipleprocessors 1210, the machine 1200 may include a single processor with asingle core, a single processor with multiple cores (e.g., a multi-coreprocessor), multiple processors with a single core, multiple processorswith multiples cores, or any combination thereof.

The memory/storage 1230 may include a memory 1232, such as a mainmemory, or other memory storage, and a storage unit 1236, bothaccessible to the processors 1210 such as via the bus 1202. The storageunit 1236 and memory 1232 store the instructions 1216 embodying any oneor more of the methodologies or functions described herein. Theinstructions 1216 may also reside, completely or partially, within thememory 1232, within the storage unit 1236, within at least one of theprocessors 1210 (e.g., within the processor's cache memory), or anysuitable combination thereof, during execution thereof by the machine1200. Accordingly, the memory 1232, the storage unit 1236, and thememory of the processors 1210 are examples of machine-readable media.

As used herein, “machine-readable medium” means a device able to storeinstructions (e.g., instructions 1216) and data temporarily orpermanently and may include, but is not limited to, random-access memory(RAM), read-only memory (ROM), buffer memory, flash memory, opticalmedia, magnetic media, cache memory, other types of storage (e.g.,Erasable Programmable Read-Only Memory (EEPROM)), and/or any suitablecombination thereof. The term “machine-readable medium” should be takento include a single medium or multiple media (e.g., a centralized ordistributed database, or associated caches and servers) able to storethe instructions 1216. The term “machine-readable medium” shall also betaken to include any medium, or combination of multiple media, that iscapable of storing instructions (e.g., instructions 1216) for executionby a machine (e.g., machine 1200), such that the instructions, whenexecuted by one or more processors of the machine (e.g., processors1210), cause the machine to perform any one or more of the methodologiesdescribed herein. Accordingly, a “machine-readable medium” refers to asingle storage apparatus or device, as well as “cloud-based” storagesystems or storage networks that include multiple storage apparatus ordevices. The term “machine-readable medium” excludes signals per se.

The I/O components 1250 may include a wide variety of components toreceive input, provide output, produce output, transmit information,exchange information, capture measurements, and so on. The specific I/Ocomponents 1250 that are included in a particular machine may depend onthe type of machine. For example, portable machines such as mobilephones may likely include a touch input device or other such inputmechanisms, while a headless server machine may likely not include sucha touch input device. It is appreciated that the I/O components 1250 mayinclude many other components that are not shown in FIG. 12. The I/Ocomponents 1250 are grouped according to functionality merely forsimplifying the following discussion and the grouping is in no waylimiting. In various example embodiments, the I/O components 1250 mayinclude output components 1252 and input components 1254. The outputcomponents 1252 may include visual components (e.g., a display such as aplasma display panel (PDP), a light emitting diode (LED) display, aliquid crystal display (LCD), a projector, or a cathode ray tube (CRT)),acoustic components (e.g., speakers), haptic components (e.g., avibratory motor, resistance mechanisms), other signal generators, and soforth. The input components 1254 may include alphanumeric inputcomponents (e.g., a keyboard, a touch screen configured to receivealphanumeric input, a photo-optical keyboard, or other alphanumericinput components), point based input components (e.g., a mouse, atouchpad, a trackball, a joystick, a motion sensor, or another pointinginstrument), tactile input components (e.g., a physical button, a touchscreen that provides location and/or force of touches or touch gestures,or other tactile input components), audio input components (e.g., amicrophone), and the like.

In further example embodiments, the I/O components 1250 may includebiometric components 1256, motion components 1258, environmentalcomponents 1260, or position components 1262, among a wide array ofother components. For example, the biometric components 1256 may includecomponents to detect expressions (e.g., hand expressions, facialexpressions, vocal expressions, body gestures, or eye tracking), measurebiosignals (e.g., blood pressure, heart rate, body temperature,perspiration, or brain waves), identify a person (e.g., voiceidentification, retinal identification, facial identification,fingerprint identification, or electroencephalogram basedidentification), and the like. The motion components 1258 may includeacceleration sensor components (e.g., accelerometer), gravitation sensorcomponents, rotation sensor components (e.g., gyroscope), and so forth.The environmental components 1260 may include, for example, illuminationsensor components (e.g., photometer), temperature sensor components(e.g., one or more thermometers that detect ambient temperature),humidity sensor components, pressure sensor components (e.g.,barometer), acoustic sensor components (e.g., one or more microphonesthat detect background noise), proximity sensor components (e.g.,infrared sensors that detect nearby objects), gas sensors (e.g., gasdetection sensors to detect concentrations of hazardous gases for safetyor to measure pollutants in the atmosphere), or other components thatmay provide indications, measurements, or signals corresponding to asurrounding physical environment. The position components 1262 mayinclude location sensor components (e.g., a Global Position System (GPS)receiver component), altitude sensor components (e.g., altimeters orbarometers that detect air pressure from which altitude may be derived),orientation sensor components (e.g., magnetometers), and the like.

Communication may be implemented using a wide variety of technologies.The I/O components 1250 may include communication components 1264operable to couple the machine 1200 to a network 1280 or devices 1270via a coupling 1282 and a coupling 1272, respectively. For example, thecommunication components 1264 may include a network interface componentor other suitable device to interface with the network 1280. In furtherexamples, the communication components 1264 may include wiredcommunication components, wireless communication components, cellularcommunication components, Near Field Communication (NFC) components,Bluetooth® components (e.g., Bluetooth® Low Energy), Wi-Fi® components,and other communication components to provide communication via othermodalities. The devices 1270 may be another machine or any of a widevariety of peripheral devices (e.g., a peripheral device coupled via aUSB).

Moreover, the communication components 1264 may detect identifiers orinclude components operable to detect identifiers. For example, thecommunication components 1264 may include Radio Frequency Identification(RFID) tag reader components, NFC smart tag detection components,optical reader components (e.g., an optical sensor to detectone-dimensional bar codes such as Universal Product Code (UPC) bar code,multi-dimensional bar codes such as Quick Response (QR) code, Azteccode, Data Matrix, Dataglyph, MaxiCode, PDF417, Ultra Code, UCC RSS-2Dbar code, and other optical codes), or acoustic detection components(e.g., microphones to identify tagged audio signals). In addition, avariety of information may be derived via the communication components1264, such as location via Internet Protocol (IP) geolocation, locationvia Wi-Fi® signal triangulation, location via detecting an NFC beaconsignal that may indicate a particular location, and so forth.

Transmission Medium

In various example embodiments, one or more portions of the network 1280may be an ad hoc network, an intranet, an extranet, a virtual privatenetwork (VPN), a local area network (LAN), a wireless LAN (WLAN), a WAN,a wireless WAN (WWAN), a metropolitan area network (MAN), the Internet,a portion of the Internet, a portion of the Public Switched TelephoneNetwork (PSTN), a plain old telephone service (POTS) network, a cellulartelephone network, a wireless network, a Wi-Fi® network, another type ofnetwork, or a combination of two or more such networks. For example, thenetwork 1280 or a portion of the network 1280 may include a wireless orcellular network and the coupling 1282 may be a Code Division MultipleAccess (CDMA) connection, a Global System for Mobile communications(GSM) connection, or another type of cellular or wireless coupling. Inthis example, the coupling 1282 may implement any of a variety of typesof data transfer technology, such as Single Carrier Radio TransmissionTechnology (1×RTT), Evolution-Data Optimized (EVDO) technology, GeneralPacket Radio Service (GPRS) technology, Enhanced Data rates for GSMEvolution (EDGE) technology, third Generation Partnership Project (3GPP)including 3G, fourth generation wireless (4G) networks, Universal MobileTelecommunications System (UMTS), High Speed Packet Access (HSPA),Worldwide Interoperability for Microwave Access (WiMAX), Long TermEvolution (LTE) standard, others defined by various standard-settingorganizations, other long range protocols, or other data transfertechnology.

The instructions 1216 may be transmitted or received over the network1280 using a transmission medium via a network interface device (e.g., anetwork interface component included in the communication components1264) and utilizing any one of a number of well-known transfer protocols(e.g., HTTP). Similarly, the instructions 1216 may be transmitted orreceived using a transmission medium via the coupling 1272 (e.g., apeer-to-peer coupling) to the devices 1270. The term “transmissionmedium” shall be taken to include any intangible medium that is capableof storing, encoding, or carrying the instructions 1216 for execution bythe machine 1200, and includes digital or analog communications signalsor other intangible media to facilitate communication of such software.

Language

Throughout this specification, plural instances may implementcomponents, operations, or structures described as a single instance.Although individual operations of one or more methods are illustratedand described as separate operations, one or more of the individualoperations may be performed concurrently, and nothing requires that theoperations be performed in the order illustrated. Structures andfunctionality presented as separate components in example configurationsmay be implemented as a combined structure or component. Similarly,structures and functionality presented as a single component may beimplemented as separate components. These and other variations,modifications, additions, and improvements fall within the scope of thesubject matter herein.

Although an overview of the inventive subject matter has been describedwith reference to specific example embodiments, various modificationsand changes may be made to these embodiments without departing from thebroader scope of embodiments of the present disclosure. Such embodimentsof the inventive subject matter may be referred to herein, individuallyor collectively, by the term “invention” merely for convenience andwithout intending to voluntarily limit the scope of this application toany single disclosure or inventive concept if more than one is, in fact,disclosed.

The embodiments illustrated herein are described in sufficient detail toenable those skilled in the art to practice the teachings disclosed.Other embodiments may be used and derived therefrom, such thatstructural and logical substitutions and changes may be made withoutdeparting from the scope of this disclosure. The Detailed Description,therefore, is not to be taken in a limiting sense, and the scope ofvarious embodiments is defined only by the appended claims, along withthe full range of equivalents to which such claims are entitled.

As used herein, the term “or” may be construed in either an inclusive orexclusive sense. Moreover, plural instances may be provided forresources, operations, or structures described herein as a singleinstance. Additionally, boundaries between various resources,operations, modules, engines, and data stores are somewhat arbitrary,and particular operations are illustrated in a context of specificillustrative configurations. Other allocations of functionality areenvisioned and may fall within a scope of various embodiments of thepresent disclosure. In general, structures and functionality presentedas separate resources in the example configurations may be implementedas a combined structure or resource. Similarly, structures andfunctionality presented as a single resource may be implemented asseparate resources. These and other variations, modifications,additions, and improvements fall within a scope of embodiments of thepresent disclosure as represented by the appended claims. Thespecification and drawings are, accordingly, to be regarded in anillustrative rather than a restrictive sense.

What is claimed is:
 1. A computer-implemented method comprising: at afirst server computer, i) accessing a data structure stored in memoryand representing a matrix having multiple dimensions, one dimension ofthe matrix representing features, and one dimension of the matrixrepresenting data points, ii) processing the data structure to separatethe matrix into multiple submatrices along a first dimension, eachsubmatrix including all cells in the matrix for a set of values in thefirst dimension and iii) providing a single submatrix of the multiplesubmatrices to each of one machine of multiple machines; at each machineof the multiple machines, computing a correlation between values in atleast one second dimension of the matrix and a value for a preselectedfeature in the matrix; at the first server computer calculating acombined correlation from the individual correlations computed by eachmachine of the multiple machines, the combined correlation for use inpredicting a value for the preselected feature based on other valuesalong the at least one second dimension; and at the first servercomputer, i) receiving a new submatrix including values along the atleast one second dimension, and ii) predicting, using the combinedcorrelation, a value for the preselected feature for the new submatrix.2. The method of claim 1, wherein computing the correlation between theat least one second dimension of the matrix and the feature value isbased on a differentiable mathematical function applied to values in theat least one second dimension.
 3. The method of claim 1, wherein thefeature dimension represents employers, wherein the at least one seconddimension comprises a dimension representing individuals, and wherein anon-zero value in a cell of the matrix represents that the individual isa current or former employee of the employer.
 4. The method of claim 3,wherein the at least one second dimension further comprises a dimensionrepresenting areas of expertise and a dimension representing a currentor former educational institution attended.
 5. The method of claim 1;wherein computing the correlation comprises computing, for at least oneadditional feature different from the preselected feature, a probabilitythat, for a slice of the matrix along the at least one second dimension,the preselected feature has a non-zero value given that the at least oneadditional feature has a non-zero value.
 6. The method of claim 5,wherein the correlation is computed based on non-zero values in a slicealong the at least one second dimension and the computed probability. 7.A non-transitory computer-readable medium storing instructions which,when implemented by processing circuitry of one or more computers, causethe processing circuitry to perform operations comprising: accessing amatrix, the matrix having multiple dimensions, one dimension of thematrix representing features, and one dimension of the matrixrepresenting data points; separating the matrix into multiplesubmatrices along a first dimension, each submatrix including all cellsin the matrix for a set of values in the first dimension; providing themultiple submatrices to multiple machines; receiving from each machineof the multiple machines a computed correlation between values in atleast one second dimension of the matrix and a value for a preselectedfeature in the matrix; computing a combined correlation from theindividual correlations computed by each machine of the multiplemachines, the combined correlation for use in predicting the value forthe preselected feature based on other values along the at least onesecond dimension; and receiving a new submatrix including values alongthe at least one second dimension; and predicting, using the computedcorrelation, a value for the preselected feature for the new submatrix.8. The computer-readable medium of claim 7, wherein computing thecorrelation between the at least one second dimension of the matrix andthe feature value is based on a differentiable mathematical functionapplied to values in the at least one second dimension.
 9. Thecomputer-readable medium of claim 7, wherein the feature dimensionrepresents employers, wherein the at least one second dimensioncomprises a dimension representing individuals, and wherein a non-zerovalue in a cell of the matrix represents that the individual is acurrent or former employee of the employer.
 10. The computer-readablemedium of claim 9, wherein the at least one second dimension furthercomprises a dimension representing areas of expertise and a dimensionrepresenting a current or former educational institution attended. 11.The computer-readable medium of claim 7, wherein computing thecorrelation comprises computing, for at least one additional featuredifferent from the preselected feature, a probability that, for a sliceof the matrix along the at least one second dimension, the preselectedfeature has a non-zero value given that the at least one additionalfeature has a non-zero value.
 12. The computer-readable medium of claim11, wherein the correlation is computed based on non-zero values in aslice along the at least one second dimension and the computedprobability.
 13. A computer-implemented method comprising: at a firstserver computer: accessing a matrix having multiple dimensions, onedimension of the matrix representing features, and one dimension of thematrix representing data points; processing the matrix to separate thematrix into multiple submatrices along a first dimension, each submatrixincluding all cells in the matrix for a set of values in the firstdimension; providing a single submatrix of the multiple submatrices toeach of one machine of multiple machines; receiving, from each machineof the multiple machines, a correlation between values in at least onesecond dimension of the matrix and a value for a preselected feature inthe matrix; and using as input the correlation as received for eachmachine of the multiple machines, generating an output representing thecomputed combined correlation, the computed combined correlation for usein predicting values for the preselected feature based on other valuesalong the at least one second dimension; receiving a new submatrixincluding values along the at least one second dimension, andpredicting, using the computed combined correlation, a value for thepreselected feature for the new submatrix.
 14. The method of claim 13,wherein the correlation between the at least one second dimension of thematrix and the feature value is represented using a differentiablemathematical function applied to values in the at least one seconddimension.
 15. The method of claim 13, wherein the feature dimensionrepresents employers, wherein the at least one second dimensioncomprises a dimension representing individuals, and wherein a non-zerovalue in a cell of the matrix represents that the individual is acurrent or former employee of the employer.
 16. The method of claim 15,wherein the at least one second dimension further comprises a dimensionrepresenting areas of expertise and a dimension representing a currentor former educational institution attended.